There are a total of 4412 publications in the corpus. This amounts to 92,970,837 total tokens (including punctuation marks) or 51,777,615 character tokens (excluding punctuation marks). The raw corpus takes 615 Mb space, the formatted corpus, the tokenized and nlp-processed corpus takes 6.1 Gb space. The median work in thecorpus is 3994 words long, the longest work is 6.2269210^{5} words long and the shortest work is 51 words long. Works with less than 50 tokens were excluded as they usually contained no ready-made OCR layer. The distribution of works by their length is given below. The interactive graph allows you to see the title of the work by hovering over it.
There are a total of 8062 publications in ENB out of 39442 total publications in 1800-1940 (20.4%). 4412 of them are included in the corpus (some with multiple sources included) - 11.2% of the total published works in 1800-1940. The distribution in time is depicted below. At the time of the compilation of the corpus, the linked works were either not digitized in a way that was suitable to add to the corpus, or did not prove accessible. This situation can change in the future and new texts can be added from the linked set too.
There are a total of 1188 unique authors represented in the corpus. 40 authors have at least 10 works in the corpus, 102 have at least 5. The number of works for the 20 most prolific authors in the dataset is given below.
By genre, the distribution within the corpus is the following. Most genres have 10-20% representation in the corpus. Poetry is slightly better reprsented, scholarly works slightly less so.
The most popular places of publication for the texts in the corpus are given below.
The genre distribution within the corpus is given below.
ENB metadata also sometimes has informaiotn on the print numbers, inside the corpus 1741 works have some measure of print numbers, i.e. 39.5%. For the ENB in 1800-1940 10660 works include this information, i.e. 27%. Their distribution over time is given below.
As a result 4608 unique files were collected, consisting of 4412 unique texts. Files with less than 50 tokens of text, and files where less than 1% of the tokens could be recognized by EstNLTK lemmatizer were excluded, as this would indicate very poor quality OCR for the automatically digitized versions. This left 4608 unique files and 4412 unique publications from it (some publications had several digital copies). From these, the 142 publications that had exact same titles and authors as an earlier publication in the corpus were also excluded, as the text is likely to have been a reprint of an earlier edition with the text largely overlapping, leaving 4270 publications in the set.
The file quality can be assessed by looking at the proportion of words that were recognized by the NLP workflow - here EstNLTK 1.4 was used. There is a baseline level of recognition for an era as the orthography becomes increasingly modern, however digitization errors or editing towards a modern standard can move the text further from the baseline. We can look at it visually by plotting the proportion of words recognized over time. We can see the transition from older writing system to a more modern one in the 1870s, as well as some texts lagging behind, and also the transition from w to v that interferes with a lot of the NLP pipelines - variation between texts is greater here as both versions remain in use for some time. Some of the variation is due to digitization errors and we can use this graph to explore where particular texts are situated. Depending on the research question, texts with too many errors may need to be excluded from analysis.
We can also use this information to assess duplicates of different texts and perhaps use the text with higher quality of digitization or check for whether the text has been edited, if the score is too high
Figure 5. Printing locations over time
Figure 6. Top book printing location until now
Figure 9. Top publishers in 1800-1940
Figure 9. Top publishers in 1800-1940
Figure 9. Top publishers in 1800-1940
##
## Attaching package: 'scales'
## The following object is masked from 'package:purrr':
##
## discard
## The following object is masked from 'package:readr':
##
## col_factor
Figure 9. Top publishers in 1800-1940
Figure 9. Top publishers in 1800-1940